Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license
نویسندگان
چکیده
We present a dataset of telephone conversations in English and Czech, developed to train acoustic models for automatic speech recognition (ASR) in spoken dialogue systems (SDSs). The data comprise 45 hours of speech in English and over 18 hours in Czech. All audio data and a large part of transcriptions was collected using crowdsourcing; the rest was transcribed by hired transcribers. We release the data together with scripts for data pre-processing and building acoustic models using the HTK and Kaldi ASR toolkits. We publish the trained models described in this paper as well. The data are released under the CC-BY-SA 3.0 license, the scripts are licensed under Apache 2.0. In the paper, we report on the methodology of collecting the data, on the size and properties of the data, and on the scripts and their use. We verify the usability of the datasets by training and evaluating acoustic models using the presented data and scripts.
منابع مشابه
Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition
We present two recently released opensource taggers: NameTag is a free software for named entity recognition (NER) which achieves state-of-the-art performance on Czech; MorphoDiTa (Morphological Dictionary and Tagger) performs morphological analysis (with lemmatization), morphological generation, tagging and tokenization with state-of-the-art results for Czech and a throughput around 10-200K wo...
متن کاملCzEng: Czech-English Parallel Corpus release version 0.5
We introduce CzEng 0.5, a new Czech-English sentence-aligned parallel corpus consisting of around 20 million tokens in either language. The corpus is available on the Internet and can be used under the terms of license agreement for non-commercial educational and research purposes. Besides the description of the corpus, also preliminary results concerning statistical machine translation experim...
متن کاملModel-free control of non-minimum phase systems and switched systems
This brief presents a simple derivation of the standard model-free control for the non-minimum phase systems. The robustness of the proposed method is studied in simulation considering the case of switched systems. This work is distributed under CC license http://creativecommons.org/licenses/ by-nc-sa/3.0/ ar X iv :1 10 6. 16 97 v1 [ m at h. O C ] 9 J un 2 01 1
متن کاملPOLYCOST: A telephone-speech database for speaker recognition
This article presents an overview of the POLYCOST database dedicated to speaker recognition applications over the telephone network. The main characteristics of this database are: large mixed speech corpus size (> 100 speakers), English spoken by foreigners, mainly digits with some free speech, collected through international telephone lines, and more than eight sessions per speaker.
متن کاملInter-Annotator Agreement on Spontaneous Czech Language
The goal of this article is to show that for some tasks in automatic speech recognition (ASR), especially for recognition of spontaneous telephony speech, the reference annotation differs substantially among human annotators and thus sets the upper bound of the ASR accuracy. In this paper, we focus on the evaluation of the inter-annotator agreement (IAA) and ASR accuracy in the context of imper...
متن کامل